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Abstract: Few-shot learning assumes that we have a very 
small dataset for each task and trains a model on the set of 
tasks. For real-world problems, however, the amount of 
available data is substantially much more; we call this a 
medium-shot setting, where the dataset often has several 
hundreds of data. Despite their high accuracy, deep neural 
networks have a drawback as they are black-box. Learning 
interpretable models has become more important over time. 
This study aims to obtain sample-based interpretability using 
the attention mechanism. The main idea is reducing the task 
training data into a small number of support vectors using 
sparse kernel methods, and the model then predicts the test 
data of the task based on these support vectors. We propose 
a sparse medium-shot learning algorithm based on a metric- 
based Bayesian meta-learning algorithm whose output is 
probabilistic. Sparsity, along with uncertainty, effectively 
plays a key role in interpreting the model's behavior. In our 
experiments, we show that the proposed method provides 
significant interpretability by selecting a small number of 
support vectors and, at the same time, has a competitive 
accuracy compared to other less interpretable methods. 
Keywords: Bayesian Meta-learning, | Medium-shot 
Learning, Sample-based Interpretability, Sparse Kernel, 
Attention 


1. Introduction 

So far, two approaches for deep learning have received more 
attention. The first approach is deep learning on a large 
dataset, which has been more successful than other machine 
learning methods in image, language, and signal processing 
[1]. In deep learning, as it is difficult for humans to analyze 
a huge amount of data, one tries to train deep neural networks 
with it so that the information in the data could be exploited 
through interaction with the model. We need a massive 
amount of data to use deep learning, but in most real-world 
problems the amount of labeled data is not enough to train a 
deep model. The second approach is known as few-shot 
learning [2]. It aims to make deep learning models like 
humans and learn new concepts well by seeing a few 
examples [3]. 

In few-shot learning, the assumption is that the number of 
training data is very small. For example, in few-shot 
classification, the number of data for each class ranges 
between one and five. This assumption is contrary to the fact 
that in real-world problems, we easily have more data for 
each task, or it is even possible for the user to provide a few 
hundred samples. Therefore, many practical problems such 
as classification of medical images [4] and time series 
prediction are naturally in the medium-shot setting. Medium- 
shot learning is an extension of few-shot learning in terms of 
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the number of data. In recent years, meta-learning methods 
have shown remarkable performance in solving few-shot 
learning problems [5]. In this paper, we consider meta- 
learning methods for the case of medium-shot setting. 

Deep neural networks have attracted widespread attention 
due to their ability to obtain high accuracies in various 
problems. However, there is a serious debate about them 
related to interpretability [6]: to what extent and on what 
basis can we trust the response of neural networks? Because 
the nature of deep networks is black-box, many methods 
have been proposed to interpret neural networks and their 
decision-making [7]. In problems where the model has to 
make a decision, the user wants to know why the model has 
made this decision. The decision of the model can be 
described in different ways. One of these methods is that the 
model determines based on the data it has made its decision. 
Therefore, the user can determine the quality of a decision by 
examining the samples that the model has selected. 

The medium size of the data in the medium-shot setting 
provides us with the possibility and opportunity of 
interpretation based on the evaluation of the entire training 
data of the task. Our goal is to train a model in such a way 
that it determines which data have a more important role in 
its decision-making, and we consider these data as support 
vectors. Our idea to achieve this kind of interpretability is to 
follow the perspective of attention in deep learning. We want 
to learn which data to pay more attention to. For this purpose, 
we present an interpretable meta-learning algorithm. We start 
our work with Deep Kernel Transfer (DKT), a metric-based 
meta-learning algorithm [8]. DKT is a Gaussian process with 
a deep kernel, so it combines the representational power of 
neural networks and the reliable uncertainty of Gaussian 
processes simultaneously. To implement the attention 
mechanism, we use sparse kernel methods and extend the 
DKT algorithm to the medium-shot setting. By sparsifying 
the expansion of the decision function, we can have sample- 
based interpretability with the selected data as support 
vectors. The resulting algorithm, Sparse DKT, reduces the 
data to a small number of support vectors for each task. In 
the Sparse DKT algorithm, only the support vectors at the 
test time directly influence the prediction of the test data 
label. The experimental results show that Sparse DKT, in 
addition to interpretability, has comparable accuracy to other 
state-of-the-art meta-learning methods, including the DKT 
algorithm. 

The main contributions of this article are: 

1. Introducing learning with the medium-shot setting and 
utilizing deep meta-learning algorithms for it; 

2. Learning a sample-based interpretable model using the 
attention mechanism; 

3. Applying sparse kernel methods for determining a small 
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subset of training data as support vectors. 

The remaining structure of this article is as follows: in 
section 2, the basic concepts about meta-learning, 
interpretability, attention perspective, sparse kernel, and 
related works are described. In section 3, the proposed 
algorithm is presented. The evaluation of the presented 
algorithm in classification will be in the section 4. In section 
5, conclusion and future works are presented. 


2. Preliminaries 

2.1. Meta-learning 

Meta-learning is one of the areas that has received attention 
in recent years [9-34]. In classic learning, in order to learn a 
task, the model is trained on the task data in such a way that 
it has a good generalization of the new data. The objective of 
meta-learning, also known as learning to learn, is to go toa 
higher level and understand how to solve tasks rather than 
just learning a single task (Figure 1). Humans face with 
different issues over time and develop better ways to deal 
with new ones by drawing on their experiences. Similar to 
humans, we should train the model on a set of tasks from the 
same distribution sequentially in meta-learning. By 
completing each task, we acquire metadata that the model can 
use to learn a new, unseen task more effectively and quickly. 


E Lm | u 


A. Meta-learning setup 

In meta-learning, as shown in Figure 2, instead of one task, 
we have a set of tasks, M = {D,}"_,, which are from the 
same distribution. According to Figure 2, for each task, 
indexed by t, we have the data D, = (X, y}, which can be 
divided into two parts, the train/support set, Df”, and the 
test/query set, Df. The test data that is used for meta-test is 
denoted by the asterisk symbol as D, = (DE, 5). 


B. Few-shot learning 

Few-shot learning refers to tasks with a few training data. For 
example, in the few-shot classification represented as N way 
- K shot, N is the number of classes in the task, and K (usually 
considered 1 or 5) training samples are available for each 
class (Figure 2 shows 3 way- 2 shot classification). Few-shot 
learning aims to make deep neural networks capable of 
learning a new concept by observing a small number of 
training samples. The small amount of training data makes it 
infeasible to train the deep neural network, but the meta- 
learning approach has achieved significant improvements in 
few-shot learning. Deep meta-learning learns a model that 
can solve a new task despite the small training data. Medium- 
shot learning is a generalization of few-shot learning, so we 
employ the meta-learning framework. 


Generalization on 
new data 


a) Learning 


— 


ES j 


Generalization on 
new task 


b) Meta-learning 


Figure 1. Difference between a) learning and b) meta-learning. In learning, training on a task data is done to generalize new data from the 
same dataset. In meta-learning, we train the model on a set of tasks sequentially. By learning to learn, we can solve the new task more 
efficiently and quickly. 


Traintask 1 


Train/Support set 


Train/Support set 


Test/Query set 


Traintask 2 


Testtask 1 


Meta-train tasks 


Meta-test tasks 


Figure 2. An example of a meta-learning setup for few-shot learning. The set of tasks M = (D,)1.., is divided into two parts, meta-train 
and meta-test. The data of each task has train and test sets, Df" and D55 respectively 
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2.2. Interpretability in Deep Neural Networks 

In deep learning, there are two main classes of approaches to 
explain the prediction of a model: feature-based and sample- 
based. In the feature-based approach, features from the input 
image that have a greater impact on the model's prediction 
are identified [35, 36]. The idea of [36] in few-shot learning 
has been applied in [37] to provide interpretable feature- 
based meta-learning. 

In the sample-based approach, the data that have the most 
impact on the network's decision-making for test data are 
identified as samples to interpret its prediction (Figure 3) 
[38, 39]. ProtoAttend [40] trains a network that compares the 
input data with training data to predict it based on the 
attention mechanism and learn an attention weight that 
demonstrates the degree of similarity between them. To 
interpret the model's decision for the input data, the data 
whose weight is not zero affect the model's prediction and 
are selected as prototypes. Because there are a lot of data in 
deep learning and it is difficult to compare them all, a subset 
of the data is typically chosen as a candidate set, and 
attention weight is only learned for the candidate set. In 
contrast, the number of data is not large in medium-shot 
setting, and since we can evaluate all the training data, 
sample-based interpretability is possible. In order to achieve 
this, we proceed according to the attention point of view. 


cs — Deep Neural Net — — ——5 Bird 


oF 


effective samples on 
model prediction 


Training data 


Figure 3. In sample-based interpretability, the training data that 
the model used to determine the label for the input data are 
specified. 


A. Sample-based interpretability through Attention 

Using the attention perspective, we can learn a model with 
sample-based interpretability [41]. Simply it means to 
compare the input data with the training data and give greater 
weight to the training data that is more similar to the input 
data when determining its label. To compare the data 
properly, we need to learn a metric space in which similar 
data are placed close together, and dissimilar data are far 
apart. This method is used in the metric-based meta-learning 
algorithms presented for few-shot learning [14—16]. In these 
papers, since the number of training samples is small, there 
is no need for sample-based interpretability, and the main 
objective is to increase accuracy. Since we have more data in 
medium-shot learning,  sample-based interpretability 
becomes important; in some applications, explaining the 


model's behavior with a small number of samples makes it 
easier for humans to understand and evaluate the model. 


B. Attention and kernel methods 

The attention mechanism and kernel methods are closely 
related [42—45]. It can be said that the idea of attention in 
deep learning is derived from kernel methods [42]. Kernel 
methods have a kernel function k(x, x^) that determines the 
degree of similarity [46]. Linear kernel, polynomial, RBF 
(Radial Basis Function), and exponential are the well-known 
kernel functions. Learning the kernel function corresponds 
to learning its parameters, e.g., in the RBF kernel 


k(x,x') = s * exp(—=|lx—x'||"} (D 


the parameters @ = (1, s) are learned during training. 

In deep kernel learning or DKL [47—50], we first use a 
deep neural network to obtain data representations, then 
apply a kernel function to them. The new deep kernel is 


k(x, x") = kafa, fo(x^)) 2) 


where k(x, x’) is the kernel function with parameter and 
fe is a deep neural network. DKL involves jointly learning 
kernel and network parameters. For example, optimization 
of the parameters in the regression of (X, y}_, with noise 
variance o? is based on the log marginal likelihood, 


log p(y|X) = 


1 
zt» Be t o?I] ty — log|K + o?I| +N log(21)) 
(3) 


where K is the kernel matrix on the training data. 


2.3. Deep Kernel Transfer 

Deep Kernel Transfer or DKT falls into the category of 
metric-based meta-learning [8]. This class of algorithms tries 
to learn a metric space to compare representations based on 
a distance measure [14-16]. DKT is a combination of 
MAML (Model-Agnostic Meta-Learning) and DKL for few- 
shot learning. MAML [21] is based on the idea of [13] 
without using an additional model as a meta-learner, learns a 
meta-parameter as an initialization for the parameters of the 
network. The meta-parameter adapts quickly to the data of 
the new task without overfitting due to a few training data. 

The computational graph of the MAML is shown in Figure 
4a Using SGD (Stochastic Gradient Descent) optimization 
on the task training data, the MAML algorithm obtains task- 
specific parameter @, from the meta-parameter 0. The inner 
loop (adaptation loop) of the MAML has a parametric form, 
so in the outer loop, we encounter the second gradient of 0 
with respect to the optimization path in the inner loop. 

The idea of DKT is to replace the inner loop computation 
with a Gaussian process, which has a non-parametric form. 
Therefore, as shown in Figure 4b, adaptation to the task is 
eliminated. Similar to the DKL, a Gaussian process is 
applied to the representations. DKT computes the marginal 
likelihood (3) on the data of each task and optimizes the 
parameters 0 and @. By meta-learning a deep kernel on a set 
of tasks, we have a kernel that can be transferred to a new 
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task without needing adaptation. By replacing the inner loop 
with the Gaussian process, the DKT algorithm provides a 
computational simplification for the MAML. Furthermore, it 
is regarded as a Bayesian meta-learning. In the regression 
and image classification in few-shot settings, DKT has 
achieved higher accuracy than MAML and other few-shot 
learning methods. 


2.4. Sparse kernel methods 

SVM (Support Vector Machine) is a popular sparse kernel 
method [51]. The Sparsity of SVM results from zeroing 
coefficient a for part of the data during the quadratic 
optimization, which determines a subset of data as support 
vectors. In few-shot learning, the MetaOptNet [30] has used 
SVM to simplify the inner loop of MAML to obtain the task- 
specific parameter without SGD optimization and not to 
encounter the second derivative in  meta-parameter 
optimization (Figure 4c). 
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ig — - 4&5» | 
1 
T———— — ——— — a! RENERD..................: 
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Figure 4. Computational graph of a) MAML, b) DKT, c) 
MetaOptNet, and d) Sparse DKT (ours). In a), adapting to the task 
is equivalent to obtaining the task-specific parameter +. In b), 
meta-parameters 0 and @ without adapting to the task are updated 
based on the marginal likelihood of the Gaussian process on the 
entire data. In c), the task-specific parameter $, is computed by 
applying SVM to the training data of the task. In d), adapting to 
the task is equivalent to specifying the support vectors, a small 
subset of the task training data 


The disadvantage of SVM in the MetaOptNet algorithm is 
that it becomes less effective in sparsifying as the data 
increases. Another disadvantage of SVM compared to the 
Gaussian process [52] is that it is not probabilistic. In 
contrast, the Gaussian process is not inherently sparse; the 
kernel matrix is calculated between the test data and all n 
training data at test time. Several sparse approximations have 
been proposed to overcome the computational and memory 
complexity in the Gaussian process [53, 54]. Almost all of 
these approximation methods specify a criterion to determine 
the significance of the data and greedily select a subset of the 
data of size m «n to be used in the kernel matrix 


approximation. The main goal of methods in [55—59] is to 
reduce the computational complexity of the Gaussian 
process by assuming that there is a set of support vectors. 
The criteria to determine the support vectors in these 
methods are usually considered for adding data to this set, so 
the number of support vectors is defined as a fixed 
hyperparameter. However, since these vectors are supposed 
to have the most impact on the model's prediction, we are 
looking for support vectors to be automatically selected with 
a small number and high accuracy. Additionally, in the 
medium-shot learning, the number of data selected as 
support vectors should depend on the task. Therefore, in the 
proposed algorithm, intending to achieve sample-based 
interpretability using Gaussian processes, we leverage the 
sparse Bayesian approach, which we will explain in the 
following section. 


3. Sparse DKT for medium-shot learning 

This section presents our meta-learning algorithm, Sparse 
DKT, for medium-shot learning. To achieve sample-based 
interpretability, we need to determine the importance of data 
in data modeling and prediction. We measure the degree of 
importance with the kernel function, so we use DKT. We 
modify this algorithm to attain sample-based interpretability 
and apply attention to it in two ways: attention in adaptation 
and attention in prediction. Attention in adaptation is 
independent of the test data and is performed only on the 
training data. The Sparse Gaussian process is trained on the 
task data; In other words, it adapts to it, and the result of this 
adaptation is the identification of support vectors. 

In contrast, attention in prediction depends on the test data 
but uses only support vectors from the entire training data. 
Due to the usage of Gaussian processes, we already have 
attention in prediction; that is, support vectors affect test 
label prediction based on how similar they are to it. We 
discuss the proposed algorithm for regression, but it can be 
easily generalized for classification. 


3.1. Sparse Gaussian process as Adaptation 
In the sparse Bayesian learning framework, Tipping 
introduces the RVM algorithm (Relevance Vector Machine) 
[60]. The advantage of this algorithm we adopted for our 
proposed algorithm is that it automatically selects the data 
that play the main role in data modeling when adapting to the 
task. 

This algorithm is essentially a Gaussian process. Assume 


that we have data — (X, y) , including the inputs X — ix; | 


and the labels y — DR Labels have Gaussian noise €; ~ 
N (0,07) added to latent function f (x) according to y(x) = 
T (xj) + ej. The prior knowledge on the function f(x) is a 
Gaussian process GP(u,kg) with mean p and kernel 
function kg. The mean is usually considered zero. 

We canrewrite the latent function f in the parametric form 
f = Kw in the equation y = f +e. K is the covariance 
matrix based on the kernel function kg(x,x'). In the 
Gaussian process, the weight w has a Gaussian distribution 
JV(0,ag!I), where a is a hyperparameter. In RVM, 
Gaussian distribution p(w|e) = N (0, A^!) is considered 
for weights, where A = diag(q@) is a diagonal covariance 
matrix. As a result, RVM is a Gaussian process with kernel 
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function: 
cx) = Xia z ko @ x) kg c^ x) (4) 


where kg(x,x;) is equal to the kernel function that is 
defined based on the training data x;. c (x, x") is an expansion 
of the product of values of the kernel function kg (x, xj) in 
which all data contribute. The kernel function of the data that 
will be included in the expansion is determined by 
coefficient a@;. When a; goes to infinity, the kernel function 
corresponding to x; data is removed; as a result, the 
expansion c(x, x") becomes sparse. The covariance matrix 
of RVM can be expressed as 
C = KA™'1K", 

The next step is that based on Bayes Equation 5, and 
having likelihood p(y|w) = N(f,071), 


pW) pwa) e 


p(w|a, y) = Pon 


obtain the posterior distribution of the weight, 


p(w|a, y) = JN (p, X) 
nu-o07?XK'y 


P= Aro KK) 
(6) 
RVM training is similar to Gaussian process training; We 
optimize the logarithm of the marginal likelihood (7) with 
respect to the hyperparameters æ and o?. 


p(y) = N(0,C +07!) 


log p(y) = 


—1/2{y"[C + c?I] y + log|C + o?I| + nlog27} 
(7) 
By deriving the Equation 7 with respect to œ and o? and 
setting them equal to zero, optimization equations are 
obtained as follows: 


anew = n 

Hi 
Yi = 1- ary 
(o-2)new = lly — Ku ||? 


n -= Ej; 

(8) 

where X'i; is the i-th diagonal component of the covariance 
matrix X in (6). y; € [0,1] indicates how much the data 
contributed to the determination of w;. To get æ and a”, we 
can use an iterative algorithm. During training, many a; 
become infinite, which causes variance and mean 
corresponding to their weights to be zero. When weight w; 
becomes zero, the kernel function at x; does not contribute 
to describing the data so that it can be removed from the 
model. The data that have non-zero weight are considered as 
support vectors. Another method to train RVM is to use the 
Expectation-Maximization algorithm [61]. In this study, we 
use the sequential algorithm proposed in the [62] (The 
authors of [62] published their code in MATLAB, and we re- 
implemented it with Python. 
http://www.miketipping.com/sparsebayes.htm). In this 
algorithm, the set of support vectors is initially empty, and 


important data are added to this set sequentially. The 
computational cost of RVM is significantly decreased by 
using this addition method, which is better for learning in the 
medium-shot setting. 


3.2. Sparse DKT algorithm 

The Sparse DKT algorithm using RVM as the inner loop, on 
the one hand, is a simplification for the MAML; on the other 
hand, it adds interpretability to the DKT. According to 
Figure 4, the difference between DKT and Sparse DKT is the 
addition of the adaptation loop. Unlike MetaOptNet, in 
Sparse DKT, the parameters of the kernel function are part 
of the meta-parameters and are updated by loss of each task. 
Sparse DKT Pseudocode is given in Algorithm 1. In meta- 
training, what is important for us from utilizing the RVM 
algorithm as the inner loop of Sparse DKT is to obtain a. We 
are interested in learning which data are most important in 
describing the whole data and consequently in the model's 
prediction. The Sparse DKT algorithm selects the data whose 
& coefficient is not infinite as task support vectors. In the 
outer loop, they are used in the optimization with RVM 
marginal likelihood (7). 


Algorithm 1. Sparse Deep Kernel Transfer (Sparse DKT) 


Require: M = {D,}7_, meta-train tasks 

Require: $ kernel hyperparameters, 0 neural network 
weights 

Require: $4, 62 step size 

while not done do 


= 


2 Sample D, from M 
3 SV=RVM(D,) //Obtain support vectors of D, 
with RVM 
4: //Use marginal likelihood to update parameters 
5: Lı = —log p(ylX, 9,0) /Eq (7) 
6 — Ø- p-PVgl0 0- BNol 
7 end while 
8 function RVM(D) 
9: //Automatically select support vectors 
` //of the dataset D 
10: Initialize & and g? 
11: while not converged: 
12: Update p and X //Eq (6) 
13: Update a and o?  // (8) 
14: return support vectors from D for finite q; 
` values 


15: end function 

At the meta-test time, for the test task with data Df" = 
(X, y) and D£5, the support vectors of the task are first 
selected from the training data Dt” by running RVM. In 
addition to the support vectors, the mean and covariance of 
the posterior weight distribution are also obtained, which we 


use in the RVM prediction distribution, 
POX, y, Xs) = N (4,02) 


A. = Kw, H= o ?X Kinny 


o2 = 0o? + k,X k,, 2= (A + o? Knn Kam) 
(9) 
where k, is the covariance between x, € DÍ and m 
support vectors. Kmn is the covariance between support 


vectors and training data. 
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4. Experiments 

We run classification tests using common datasets in few- 
shot learning in a medium-shot setting to evaluate the Sparse 
DKT algorithm. The number of samples has been chosen in 
such a way that we get out of the few-shot mode. We used 
PyTorch and GPyTorch [63] for the implementation of the 
Sparse DKT. 

To compare Sparse DKT with DKT, Feature Transfer, 
MAML, and MetaOptNet, we have considered Omniglot, 
CUB-200, and  minilmageNet dataset for image 
classification (Figure 5). 
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Figure 5. Images from datasets used in classification 


Omniglot consists alphabet of 50 languages and has 20 
hand-written samples for each character. CUB-200 contains 
200 classes of different bird species. MinilmageNet has 100 
classes which is a subset of ImageNet classes. Each class has 
600 images. We run 2-way and 5-way classifications test. As 
in the DKT paper, classification is done one-versus-rest 
(Figure 6), i.e., for each class, we consider a binary Gaussian 
process model with labels {-1,1} and apply the sigmoid 
function to its output in order to have a probabilistic 
interpretation (for MetaOptNet, we also used binary SVMs 
for multi-class classification in experiments). The model 
whose output has the highest probability determines the class 
of the test data. We used a linear kernel in experiments and 
a deep neural network that has a similar architecture to the 
network used in the DKT paper (Figure 7). 

In Feature Transfer, a network and classifier are first 
trained on samples for the training classes. When fine- 
tuning, the network parameters are fixed, and a new classifier 
is trained on the test classes. MAML depends on the number 
of gradient steps in the inner loop and has low accuracy at a 
few steps. Increasing the gradient steps also leads to an 
increase in computation and memory consumption. In order 
to be able to test MAML in 10 steps adaptation, we used its 
first order approximation [28]. Table 1 shows the result of 
Omniglot 5 way- 15 shot classification. 


Figure 6. One-versus-rest scheme. Each model is a binary 
classifier for input data with labels (-1, 1}. For a probabilistic 
output, a sigmoid function ø is applied to it. 
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Figure 7. The CNN used as a backbone for classification. It 
consists of 4 convolutional layers, each consisting of a 2D 
convolution, a batch-norm layer, and a ReLU non-linearity. 


Table 1. Average accuracy and standard deviation on Omniglot 
classification with average number of support vectors 


Method Omniglot 5 way - 15 shot SVs 
Feature Transfer 99.36+0.08 - 
MAML 95.80+0.312 - 
DKT 99.52+0.211 75 
MetaOptNet 99.46+0.141 13 
Sparse DKT 99.33+0.1 6 


Sparse DKT is more accurate than MetaOptNet and close 
to DKT, while DKT uses all training data of 5 classes as 
support vectors for its prediction. MAML can achieve more 
accuracy at the cost of more adaptation steps. Table 2 shows 
the classification results of CUB and minilmageNet. Due to 
the limited resources in this section, we had to run 2-way 
classification. The number of task training data in CUB and 
minilmageNet is 50 and 125, respectively. Feature transfer 
overfits in the few-shot setting. However, it was able to get 
higher accuracy than other methods in our experiments. We 
believe that the accuracy of Feature Transfer decreases when 
the new task's classes diverge more from the training classes. 
We leave further investigations to future works. 

Sparse DKT is more interpretable and has higher accuracy 
than MetaOptNet, with a smaller number of support vectors. 
The efficiency of MetaOptNet in sparsity decreases with the 
increase of training data due to the weakness of SVM. In 
minilmageNet classification, the proposed method has 
selected 14 support vectors on average from 250 data, while 
MetaOptNet has selected 76 support vectors. Additionally, 
the experiments on these different datasets show that the 
number of support vectors for each application depends on 
intra-class and inter-class similarity. The metric space 
learned by the Sparse DKT to separate classes affects the 
number of support vectors. 

In Figure 8, we have given an example of testing the 
trained model with the Sparse DKT and DKT on a 2 way — 
50 shot classification task from the CUB meta-test dataset. 


Journal of Computer and Knowledge Engineering, Vol.5 , No.2. 2022. 


51 


In this task, Sparse DKT has the same accuracy as DKT. In 
Figure 9, task training data are shown, and the data that are 


support vectors have been marked with a red line around the 
image. 


Table 2. Average Accuracy and Standard Deviation on CUB and miniImageNet Classification with Average number of Support Vectors 


Method CUB2 way-50shot | SVs | minilmageNet 2 way - 125 shot | SVs 
Feature Transfer 95.230.381 - 93.131-0.530 : 
MAML 92.33+1.069 - 85.63+0.176 - 
DKT 93.98+0.448 100 92.0+0.4 250 
MetaOptNet 92.27+1.313 33 89.70+0.56 76 
Sparse DKT 93.750.909 21 91.080.913 14 


real: 0 


real: 1 


real: 0 


real: 1 


Figure 9. Sample-based interpretability of Sparse DKT in CUB 2 way — 50 shot. Support vectors of the two classes (a, b), highlighted 


with a red square, are the basis of the model's prediction. 
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Support vectors 


Test image 


Training data 


Figure 10. Comparing kernels of a) Sparse DKT and b) DKT: a) the most similar support vectors to test image, b) the most similar 
training data to test images. The green line above the images on the right, shows that they have the same label as the test image. 


We compared the learned kernels of Sparse DKT and 
DKT in Figure 10. For Sparse DKT similarity of test images 
to support vectors is computed. In each row, the images are 
sorted in the order of the most similar from left to right. The 
green and red lines on top of the right image, show whether 


5. Conclusion 

In this study, we introduced medium-shot learning as a 
generalization of few-shot learning for real-world 
applications. Considering that interpretability in deep 
learning models is becoming increasingly more important, 
especially — in sensitive scenarios, | sample-based 
interpretability can be easily obtained by reducing the data to 
a small number of support vectors in medium-shot learning. 
We considered sparse kernel methods from an attention- 
based perspective to have sample-based interpretability. The 
proposed Sparse DKT algorithm leverages Sparse Gaussian 
processes in the meta-learning framework and selects the 
most important training data as support vectors. At the test 
time, it makes the predictions based on support vectors. 

The impact of marginal likelihood in the trade-off between 
accuracy and the number of support vectors, as well as the 
impact of more task training data, is one of the key areas for 
future work. Using improved versions of RVM [64, 65] 
would be effective in increasing the accuracy of Sparse DKT. 
Since SVM in MetaOptNet is less effective in sparsifying, 


the labels of the right images are the same or different from 
those of the test image. The vertical green line in the test 
image indicates that the model accurately predicted the label. 
The Sparse DKT kernel can detect the similarity well, even 
though the number of support vectors is very small. 

When data increases, we can use GLASSO [66], which also 
has a probabilistic solution, as an alternative to SVM in 
MetaOptNet. Another future work is investigating 
variational sparse Gaussian processes [67—70] that use 
variational inference for increasing the lower bound of the 
marginal likelihood algorithm. We can use the combination 
of point processes [71] with it to determine the support 
vectors in sparse variational Gaussian processes. 
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